PHONEME DIVIDING METHOD 
USING MULTILEVEL NEURAL NETWORK 



Background of the Invention 

The present invention relates to a phoneme dividing method 
using a multilevel neural network. 

Conventional phoneme dividing technologies complicate their 
systems by finding the border of phonemes through an analy^^^*o-'\S 
-wkictr prefixed various phonetic knowledge and rules after 
extracting the frequency component, that is, the spectrogram, 
from an acoustic signal. 

Without an effective and optimal method for combining 
various knowledge and rules used in phoneme division, the 
performance of system is not ^reliable and drastically 
deteriorated depending upon the €h- an u e uf situation. 0 

A 

There is a method for finding the border of ^phoneme by 
comparing characteristic patterns with an incoming signal in 
phoneme division after previously extracting the characteristics 
of all phonemes and storing them in patterns. This method 
requires information -^pn- the "characteristic patterns for all 
phonemes to undesirably increase the volume of memory of the 
system and also the amount of calculation in performance. 



Summary of the Invention 
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Therefore, it is an object of the present invention to 
provide a phoneme dividing method using a multilevel neural 
^ network for precisely and efficiently capturing the point 
^ phoneme border, using only the variation of vocal ^cignal . 
5 appearing at the border of phonemes, without additional knowledge 
_£er"^phoneme itself, to be thereby utilized in application fields 
requiring automatic phoneme division or phoneme labeling. 

To accomplish the object of the present invention, there is 

O provided a phoneme dividing method using a multilevel neural 

03 

10 SI network applied to a phoneme dividing apparatus having a voice . 
* £ . . Stfru 

^ uj input portion for outputting a .vocal— sample digitally converted* 

^ p -# rom v o ice mad o y a preprocessor for extracting a characteristic 

^jL^; vector suitable for phoneme division^ from the A vocal sample— input 

from the voice input portion, a multi-layer perceptron (MLP) 

15 "tP phoneme dividing portion for finding and outputting the border 

^ CTI of phonem e^ using the characteristic vector of the preprocessor, 

and a phoneme border outputting portion for outputting position 

information on the^aorder e#- phoneme-of the MLP phoneme dividing 

portion in the form of frame position, the method comprising the 

qJZQ steps of: (a) sequentially segmenting and framing A voice with 

digitalized voice samples, extracting characteristic vectors by 

vocal frames, and extracting an inter-frame characteristic vector 

of the difference between nearby frames of the characteristic 

vectors by frames, to thereby normalize ^he^*maximum and minimum 

-©*- the- characteristics; (b) initializing weights present between 

an input layer and ^hidden layer and between the hidden layer and 
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^output layer of the MLP, designating-an-output target data of the 
MLP, inputting the characteristic vectors to the MLP for 
learning, and storing and finishing information on the weight 
obtained through learning and the standard of the MLP if the 
reduction rate of^ mean squared error converges within a 
permissible limit; and (c) reading the weight obtained in tjaer" 
step (b), receiving the characteristic vectors, performing an 
operation of phoneme border discrimination to generate an output 
value, discriminating the phoneme border according to the output 
value, and if the current analyzed frame arrives two frames 
preceding the final frame of incoming voice, c-utputting a frame 



number indicative of the border of phoneme as a final result. 




Brief Description of the Attached Drawings 

/ 

FIG. 1 is arblock diagram of a system to which the present 
invention is .applied; 

FIG. 2f shows a configuration of a multilevel neural network 
used for /the present invention; and 

FI^. 3 is a flowchart of one embodiment of the present 
invent! 

Detailed Description of Preferred Embodiment 

Hereinafter, a preferred embodiment of the present invention 
will be described below. 

In FIG. 1 reference numeral 1 represents a voice input 

/-S 

portion. Reference numeral 2 is a preprocessor, 3 being a multi- 



layer perception (MLP) phoneme dividing portion, and 4^be-ing-a 
voice border output portion. 

Voice input portion 1 comprises a microphone for converting 
an aerial vocal waveform into an electric vocal signal, a band- 
5 pass filter for eliminating low-frequency noise and high- 
frequency aliasing from the vocal signal input as an electric 
analog signal, and an analog-to-digital converter (ADC) for 
converting the analog vocal signal into a digital vocal signal. 

O The voice input portion outputs a vocal sample converted into 

CO 

10 -==4 digital from the voice, to preprocessor 2. 

Oj Preprocessor 2 extracts characteristic vectors suitable for 

03 phoneme division from the vocal samples input from voice input 
= portion 1, and outputs them to MLP phoneme dividing portion 3. 
M MLP phoneme dividing portion 3 finds the border of phoneme, using 
15 *D characteristic vectors input from preprocessor 2, and outputs the 
CP result to phoneme border output portion 4 . Phoneme border output 
^ portion 4 outputs position information on^phoneme border 
automatically divided in MLP phoneme dividing portion 3 in the 
gz^ form of A frame position. 
20 Referring to FIG. 2, one embodiment of the present invention 

implements an effective and reliable automatic phoneme segmenter 
by using a multi-layer perceptron (MLP), one kind of neural 
network, in order to complement the drawbacks of the conventional 
phoneme dividing method based upon knowledge or rules. 
^5 A phoneme dividing method using MLP is very favorable -bo— 

-striving ^e crease o£ ^performance caused due to imperfect modeling 
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of knowledge or rules on the border of phoneme contained in a 
vocal signal. In this method, functions required in phoneme 
division are learned voluntarily from the characteristic vectors 
extracted from a large amount of vocal data so that the MLP 
itself finds the knowledge or rules contained in the vocal 
signal, without previously ^introducing specific suppositions, 
rules or knowledge on the A bbrder -ef— phoneme . Accordingly, the 
method of the present invention eliminates the introduction of 
unsure supposition or additional processing of distribution or 
modeling of the vocal signal in order to facilitate its modeling. 

MLP used in the present invention is made in a multiple 
structure of three layers of input, hidden and output layers. As 
shown in the drawing, the input layer placed on the bottom is 
made with 73 input nodes of 72 input nodes for inter-frame 
characteristic vectors extracted from four inter-frame 
differences generated among five sequential frames, and one input 
node for an input value 1 to be used instead of the threshold 
value comparison process in the hidden layer of MLP. 

The output node of the output layer is made with two nodes 
of the first node indicative of the border of phoneme, and the 
second node not indicative of the border of phoneme. The hidden 
layer placed between the input and output layers is to perform 
nonlinear discrimination that the MLP must implement cf&feuaily-. 

The following nonlinear sigmoid function is used for the 
activation function of the hidden layer. 

y = (exp(x) - 1) / (exp(x) + 1) 



where x and y represent the input and output of the activation 
function, respectively. 

The number N of nodes of the hidden layer is known to be 
closely relevant to the final function of MLP. It is noted 
5 through an experiment using various kinds of data that it is 
appropriate that the number of nodes be between 10 and 30. 

Between the input layer and hidden layer and between the hidden 

d-f- 

layer and output layer, there are weights which connect all the 

" &+ 

p nodes of the respective layers. Because the weights connect all rt 
10 %j the nodes between the layers, its number is 73 x n (the number 
m of input nodes x the number of hidden nodes ) in case of the input 
m layer and hidden layer. The number of weights is N x 2 (the 
E number of hidden node x the number of output node). These 
M> functions are previously obtained through learning using an error 
15 tp back propagation algorithm, stored in a memory, and then read out 
g) in phoneme division. 

FIG. 3 shows a procedure of the phoneme division algorithm 
in preprocessor 2 and MLP phoneme dividing portion 3, having two 
(A- parts of^learning process and^ividing process of the MLP phoneme 
20 dividing algorithm. 

Above all, the process of voice framing and characteristic 
vector extraction is performed in preprocessor 2 and used 
commonly in the learning and dividing processes. In selecting the 
characteristic vectors in the present invention, factors 
25 explicitly indicative of the difference of spectrum between 
frames are induced in order to use t h e- fac t that the variation 
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of a vocal spectrum is severe at the border of phonemes. 

Voice _Xs— sequentially segmented in a length so long as to 

^ voice- 

extract the A characteristics e€ — vcrrce* from digitalized voice 
samples, for the purpose of voice framing in step 10. Voice 
^5 framing is performed by taking Hamming windows in ^ t-h o 1 eL pg4-h p -F 
16 msec every 10 msec with respect to the overall incoming vocal 
. samples . 

Then, the characteristic vectors are extracted from the 
p voice frames in step 11 containing two substeps . In the first 
^ step, characteristic vectors' -by— frames effectively indicative of 
m the characteristics of voice are extracted on^baTSis of phonetic 
03. knowledge, with respect to the respective voice frames obtained 
= before. In the second step, inter-frame characteristic vectors 
H- of the difference between nearby frames with respect to the 
15 *P characteristic vectors by frames obtained in the first step are 
CP extracted to be used as the final characteristic vectors input 
to MLP phoneme dividing portion 3 . 

For^more detailed description of the above procedure, the 
characteristic vectors first obtained with respect to the 
20 respective frames are as follows. 

(1) frame energy: indicates the intensity of phonation by 
frames and is found according to the following equation. 

' ENG_FRM(t)=loglO(X s(n>s(n)), n-0,1,. . . ,N 

) 

where s(n) represents a vocal sample belonging to the t th frame, 



0 



and iji 
^L-- N represents the fr ength o£ vocal frame^. u 

(2) 16th degree Mel-scaled fast Fourier transform (FFT) : 
First, FFT is performed in order to obtain the spectrum, the 
5 frequency characteristic of voice by frames, and the frequency 
component of voice is classified into predetermined 16 frequency 
fl^/ bands similar to jthe""human hearing characteristics, to obtain 
16th degree energy by bands which is used as the coefficient of 
B the Mel-scaled FFT. The j th degree Mel-scaled FFT coefficient for 
10 M frame index t is obtained as follows. 

1 MSPC(j,t)-loglO(g s(j,t,f )) 



yy where f represents the frequency belonging to the respective 
01 frequency bands; 

j is the index of the respective frequency bands ; and 
s(j,t,f) is j th degree frequency band amplitude spectrum of 
t th frame obtained from FFT by frequencies . 

(3) energy ratio by bands: It is very important to precisely 
discriminate phonemes into voiced sound and voiceless sound in 
phoneme division. The difference between voiced and voiceless 
sounds is the distribution of energy by frequency bands . In order 
to discriminate voiceless and voiced sounds in the present 
invention, the low-frequency energy between 0 and 3 kHz and the 
high-frequency energy between 3 and 8 kHz are obtained 
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respectively, and their ratio is selected as one of the 
characteristic vectors. 

jfV ^ ENG_RTO(t) = loglO(ENG_LO¥(t))-loglO(EN"G_HIGHCt)) 
ENG_LOW(t)=]r s(f ,t), f=0, . . . ,3Wiz 



ENG_HTGH(t)=]T s(f ,t),;f-3kHz,. . . ,SkHz 



J 



SI Where ENG_LOW(t), and ENG_HIGH(t) are energies of the low and 

f 

01 high freguency bands of the t th voice frame, respectively, which 

03 are obtained by the sum of components contained in the respective 

=_ bands at the amplitude spectrum obtained in the FFT. 

m, The i n ter-frame characteristic vectors used as the final 

h* 

^ input of MLP phoneme dividing portion 3 can be obtained by 
01 finding the difference between nearby frames with respect to the 
first characteristic vectors by frames on basis of the fact that 
the variation of phoneme division occurs at the border of 
phonemes . 

( 1 ) difference of frame energy between nearby frames 
dENG_FRM ( t ) = | ENG_FRM ( t ) - ENG_FRM ( t- 1 ) | 

(2) inter-frame difference of 16 th degree Mel-scaled FFT 
dMSFC(j,t) = |MSFC(j,t) - MSFC< j,t-l) | , j=0,l,...,15 

Here, j represents the respective degrees of the coefficients. 
<3) inter-frame difference of energy ratio by frames 
dENG_RT0 ( t ) = | ENG_RT0 ( t ) - ENG_RTO ( t- 1 ) | 
9 
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After the characteristic vectors are extracted as above, 
they are normalized in step 12 whose maximum and minimum become 
1 and -1, respectively, in order to be used as the input of MLP 
phoneme dividing portion 3 . 
5 In the learning process of MLP phoneme dividing portion 3 

using the normalized characteristic vectors, weights present 
between the input and hidden layers and the hidden and output 
layers are initialized in step 13 as the initial learning step 

Q of MLP phoneme dividing portion 3. The initial value is 

03 

10 ^ established as an arbitrary value distributed between 1 and -1. 

En After this step, output target data of the output layer, 

m which teaches finding the border of phonemes, is designated in 

= step 14. The output target data by frames is egual to the number 

H of the MLP output nodes, having values of (1,-1) in case of the 

H 

15 iy border of phoneme and (-1,1) in other cases. This output target 

01 data is made to coincide with the frame position of corresponding 

^ characteristic vectors using information on the A border - of phonom o 

/H rnCi d **-S£- S 

obtained from previously phoneme-divided voice -d^afeaseir 

After the designation of output target data, the 
20 characteristic vectors, learning data, are input to the input 
layer of the MLP in step 15 so as to teach the MLP in step 16. 
The input layer has 7 3 nodes of 72 input nodes for the input of 
the four sequential inter-frame characteristic vectors and one 
input node for 1 to be input instead of the threshold value 
25 comparison procedure of the hidden layer. 

The four inter-frame characteristic vectors are extracted 
10 



among four intervals generated from five frames including 
preceding and succeeding two frames t-2, t-1, t+1, t+2, centering 
on the currently analyzed frame t, as shown in the lower portion 
of FIG. 2. The learning algorithm of the phoneme dividing MLP 
5 uses the generally used error back propagation algorithm. 

After this learning process of MLP, if the reduction rate 
of mean squared error converges within a permissible limit in 
step 17, the weights obtained through learning and information 
p on the standard of the MLP are stored in step 18 to finish the 

10 O learning process. After the learning process, the voice is 

¥ " YD) 

/l^m sequentially segmented in a length so long as to extract the. 

characteristics - of voico from the digitalized vocal samples for 

s voice framing in step 10, and the characteristic vectors are 

M extracted in step 11 and normalized in step 12. 

H 

15 \p The weights obtained in the learning process are read into 

\D 

gi the hidden layer of the MLP in step 19. Then, the 72 
characteristic vectors obtained in the above process are input 
in the sequence of the input nodes of the MLP, and 1 is input to 
the final 73 th input node in step 20. 
20 In MLP phoneme dividing portion 3, the output value for 

phoneme border discrimination is produced through the following 
MLP operation with respect to incoming characteristic vectors in 
step 21. 
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(L^f I) ^ = SGMOD(X IN(i) x WGT_IH(io))/i=0,l 72 j-0,1 n- 

y I ID(N-1)= 1, 

OTTO- SGMOD(j; HID(j) x ¥GT_HO( j A) ) / j-0, 1, , . , ,N-1, k=0,l 

D 

where IN (-3-) represents the input of the i th input node; 
OUT(k) is the output of the k th output node; 
WGT_IH ( i , j ) is the weight connecting the i th input node and 

q j th hidden node; 

U3 

5 si WGT_HO(j,k) is the weight connecting the j th hidden node and 

Eft the k th output node; and 

d 

m SGMOD represents the aforementioned sigmoid function. 

N= 

5 Value 1 is designated to the final hidden node instead of 

the threshold comparison procedure in the final output node. 
10 # When the output values operated in MLP phoneme dividing 

yp 

yl portion 3 are compared in discriminating the border of phoneme, 
if the first output value OUT(0) is positive, the analyzed frame 
is the border of phoneme. In contrast, if OUT(l) is positive, it 
is determined in step 22 that the frame is not the border of 
15 phoneme. 

In step 23, it is checked whether the currently analyzed 
frame arrives two frames preceding the final frame of the 
incoming voice. If not, the procedure of inputting the 
characteristic vectors to the MLP input layer is iterated. If the 
20 currently analyzed frame arrives two frames preceding the final 
frame, the value expressed as a frame number indicative of the 
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border of phoneme is output as the final result in step 24 , and 
the whole procedure ends . 

In implementing a voice recognition system which makes -arb 



phoneme and enables precise and effective phoneme division 
preprocessing essentially reguired phoneme recognition based upon 
phoneme division with respect to the divided phoneme segments. 

Q In addition, the present invention enables automatic voice 
10 vj division instead of the conventional manual operation by voice 

ffl experts in constructing a large volume of phoneme-divided voice 



D3 database reguired in implementing a phoneme-unit voice 
= recognition and voice mixing system. This reduces time and cost. 



15 =0 reference to the preferred embodiments thereof, those skilled in 



r|i the art will readily appreciate that various modifications and 
substitutions can be made thereto without departing from the 
spirit and scope of the invention as set forth in the appended 
claims . 




Although the present invention has been described above with 
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